The educational dataset was explored, and a classification machine learning model was built by me..While doing the project, I progressed task by task and it was explained what I needed to do in each task.
In the first part, I explored different tables using functions such as (head, shape, info, and describe). I dealt with duplicate records and missing values, and performed some exploratory tasks.
In the second part, I built a classification model and reported its accuracy.
I you worked on a public educational dataset shared by UK Open University. Although throughout this document you will be provided with sufficient information about this public dataset, you are strongly recommended to refer to https://analyse.kmi.open.ac.uk/open_dataset for more detailed information.
There is also Kaggle page where you can see some analysis performed shared by other publicly. I think they can be also helpful if you want to explore the dataset beyond this assignment.
https://www.kaggle.com/datasets/rocki37/open-university-learning-analytics-dataset
Courses table (courses.csv) contains the list of all available modules and their presentations.
The columns are:
The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import math
import seaborn as sns
from IPython.display import Image
import plotly.io as pio
TASK1: Identify and treat duplicate/missing values (if there is any).
courses_data = pd.read_csv("courses.csv")
print(courses_data.head())
code_module code_presentation module_presentation_length 0 AAA 2013J 268 1 AAA 2014J 269 2 BBB 2013J 268 3 BBB 2014J 262 4 BBB 2013B 240
courses_data.isnull().sum()
code_module 0 code_presentation 0 module_presentation_length 0 dtype: int64
courses_data don't have na values
courses_data.duplicated().sum()
0
courses_data don't include duplicate values
TASK2: Find out how many courses started in February vs October, and compare their length. Interpret the results.
courses_data['start_month'] = courses_data['code_presentation'].str[-1]
feb_courses = courses_data[courses_data['start_month'] == 'B']
oct_courses = courses_data[courses_data['start_month'] == 'J']
total_feb_courses = feb_courses['module_presentation_length'].sum()
total_oct_courses = oct_courses['module_presentation_length'].sum()
avg_length_feb = feb_courses['module_presentation_length'].mean()
avg_length_oct = oct_courses['module_presentation_length'].mean()
print(f"Number of courses starting in February: {len(feb_courses)},their total length: {total_feb_courses}")
print(f"Number of courses starting in October: {len(oct_courses)},their total length: {total_oct_courses}")
print(f"Average length of courses starting in February: {avg_length_feb:.2f} days")
print(f"Average length of courses starting in October: {avg_length_oct:.2f} days")
Number of courses starting in February: 9,their total length: 2159 Number of courses starting in October: 13,their total length: 3463 Average length of courses starting in February: 239.89 days Average length of courses starting in October: 266.38 days
according to meaning average presentations october longer than february
StudentInfo (studentInfo.csv) file contains demographic information about the students together with their final result. File contains the following columns:
studentInfo_data = pd.read_csv("studentInfo.csv")
print(studentInfo_data.head())
code_module code_presentation id_student gender region \
0 AAA 2013J 11391 M East Anglian Region
1 AAA 2013J 28400 F Scotland
2 AAA 2013J 30268 F North Western Region
3 AAA 2013J 31604 F South East Region
4 AAA 2013J 32885 F West Midlands Region
highest_education imd_band age_band num_of_prev_attempts \
0 HE Qualification 90-100% 55<= 0
1 HE Qualification 20-30% 35-55 0
2 A Level or Equivalent 30-40% 35-55 0
3 A Level or Equivalent 50-60% 35-55 0
4 Lower Than A Level 50-60% 0-35 0
studied_credits disability final_result
0 240 N Pass
1 60 N Pass
2 60 Y Withdrawn
3 60 N Pass
4 60 N Pass
TASK1: Identify and treat duplicate/missing values (if there is any)
studentInfo_data.duplicated().sum()
0
there are no duplicated values
studentInfo_data.isnull().sum()
code_module 0 code_presentation 0 id_student 0 gender 0 region 0 highest_education 0 imd_band 1111 age_band 0 num_of_prev_attempts 0 studied_credits 0 disability 0 final_result 0 dtype: int64
studentInfo_data.shape
(32593, 12)
data have 1111 missing values in imd_band,removing imd_band columns would result in a loss of information for our analysis. The data has 32593 rows, which is much larger than 1111. For this reason, we will remove the rows containing missing values.
studentInfo_data.dropna(inplace=True)
studentInfo_data.isnull().sum()
code_module 0 code_presentation 0 id_student 0 gender 0 region 0 highest_education 0 imd_band 0 age_band 0 num_of_prev_attempts 0 studied_credits 0 disability 0 final_result 0 dtype: int64
studentInfo_data.shape
(31482, 12)
this is new shape of this data
TASK2: Treating categorical variables.
For this table, besides fixing any potential issue about duplicate/missing values, you are expected to explore the categorical variables: such as highest_education, imd_band, and age_band.
In particular, you may want to check if some categories of highest_education, imd_band, age_band variables (e.g., Post Graduate Qualification in highest_education) contain few instances. In such cases, you may need to merge the minority categories with a major category and even dedice to create a new set of (fewer) categories based on the existing ones. In some cases, you may even want to decide the reduce the number of categories (if you think they are many).
As long as you provide the rationale, you can decide such details by yourself. You should work on AT LEAST TWO categorical variables in this task.
studentInfo_data.describe(include='all').loc[['unique'],:]
| code_module | code_presentation | id_student | gender | region | highest_education | imd_band | age_band | num_of_prev_attempts | studied_credits | disability | final_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| unique | 7 | 4 | NaN | 2 | 13 | 5 | 10 | 3 | NaN | NaN | 2 | 4 |
the categorical variables are code_module,code_presentation,gender,region, highest_education, imd_band, age_band, disability, final_result
highest_education
px.bar(studentInfo_data['highest_education'].value_counts())
No formal quals and post graduate qualification are very low compared to the others and they can be misleading so I will combine them with A level or equivalent.
studentInfo_data['highest_education'] = studentInfo_data['highest_education'].replace({
'Post Graduate Qualification':'A Level or Equivalent',
'No Formal quals':'A Level or Equivalent'
})
px.bar(studentInfo_data['highest_education'].value_counts())
imdb_band
px.bar(studentInfo_data['imd_band'].value_counts())
The percent sign for the 10-20 bar is not placed
studentInfo_data['imd_band'] = studentInfo_data['imd_band'].replace({'10-20':'10-20%'})
px.bar(studentInfo_data['imd_band'].value_counts())
The values of consecutive data are very close to each other, so by combining consecutive data, I will reduce the number of categories to 5, that is, each of them will represent 20 percent.
studentInfo_data['imd_band'] = studentInfo_data['imd_band'].replace({
'0-10%':"0-20%",'10-20%':"0-20%",
'20-30%':"20-40%",'30-40%':"20-40%",
'40-50%':"40-60%",'50-60%':"40-60%",
'60-70%':"60-80%",'70-80%':"60-80%",
'80-90%':"80-100%",'90-100%':"80-100%"
})
px.bar(studentInfo_data['imd_band'].value_counts())
age_band
px.bar(studentInfo_data['age_band'].value_counts())
The rate of people over 55 years old is very low when compared to other age groups. age_band is divided into two categories: those over 35 years of age and those younger than 35 years of age.
studentInfo_data['age_band'] = studentInfo_data['age_band'].replace({
'55<=':'35<=',
'35-55':"35<=",
'0-35':"35>"
})
px.bar(studentInfo_data['age_band'].value_counts())
TASK3: Demographic Information and Performance Levels
More importantly for this table you are expected to explore various relationships between final_result and at least three categorical variable (e.g., did students with HE qualification perform better, did students with low imd_band withdraw more often, or did geographic region play any role? etc.). For this purpose you can visualize data and compute some basic statistics.
You must use at least two different chart types (e.g., bar or line or pie) to illustrate how the success/failure rates differ between different categories (e.g., education level, regions, imd_band, age, etc.). At least in one case, the chart should also denote the gender to illustrate the possible interaction between gender and the other categorical variable (e.g., do european females perform better than asian males -just an example :)).
px.sunburst(
data_frame=studentInfo_data,
path=['gender', 'highest_education', 'age_band'],
values=studentInfo_data.final_result.factorize()[0],
width=1000
)
When you look at the graphs, many comments can be made, for example value of HE Qulification of men with all age groups higher than the women
sns.catplot(data=studentInfo_data.loc[:,['gender', 'age_band', 'region',
'final_result']], x='final_result',
col='gender', row='region', hue='age_band',
hue_order=['35>', '35<='], kind='count')
C:\Users\sidar\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
<seaborn.axisgrid.FacetGrid at 0x1aa30407b90>
When you look at the graphs, many comments can be made, for example, in the Scottish region, men over 35 are more successful than women over 35.
Registration table (studentRegistration.csv) contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns:
final_result column in the studentInfo.csv file.TASK1: As the first task, you need to ensure that there are no conflicts between studentRegistration.csv and studentInfo.csv dataset in terms of Withdrawal status of unregistered students. For example, if a student unregistered from a course at some point (which can be found in "studentRegistration.csv"), his/her final_result should be Withdrawal.
regdata = pd.read_csv("studentRegistration.csv")
print(regdata.head())
code_module code_presentation id_student date_registration \ 0 AAA 2013J 11391 -159.0 1 AAA 2013J 28400 -53.0 2 AAA 2013J 30268 -92.0 3 AAA 2013J 31604 -52.0 4 AAA 2013J 32885 -176.0 date_unregistration 0 NaN 1 NaN 2 12.0 3 NaN 4 NaN
studentInfo_data.head()
| code_module | code_presentation | id_student | gender | region | highest_education | imd_band | age_band | num_of_prev_attempts | studied_credits | disability | final_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAA | 2013J | 11391 | M | East Anglian Region | HE Qualification | 80-100% | 35<= | 0 | 240 | N | Pass |
| 1 | AAA | 2013J | 28400 | F | Scotland | HE Qualification | 20-40% | 35<= | 0 | 60 | N | Pass |
| 2 | AAA | 2013J | 30268 | F | North Western Region | A Level or Equivalent | 20-40% | 35<= | 0 | 60 | Y | Withdrawn |
| 3 | AAA | 2013J | 31604 | F | South East Region | A Level or Equivalent | 40-60% | 35<= | 0 | 60 | N | Pass |
| 4 | AAA | 2013J | 32885 | F | West Midlands Region | Lower Than A Level | 40-60% | 35> | 0 | 60 | N | Pass |
studentInfo_data = studentInfo_data.merge(regdata.loc[:
,['id_student','code_module','date_unregistration']], how='inner',
on=['id_student','code_module'])
studentInfo_data['final_result'] = np.where(~studentInfo_data['date_unregistration'].isnull(), 'Withdrawn', studentInfo_data['final_result'])
print(studentInfo_data.head())
code_module code_presentation id_student gender region \
0 AAA 2013J 11391 M East Anglian Region
1 AAA 2013J 28400 F Scotland
2 AAA 2013J 30268 F North Western Region
3 AAA 2013J 31604 F South East Region
4 AAA 2013J 32885 F West Midlands Region
highest_education imd_band age_band num_of_prev_attempts \
0 HE Qualification 80-100% 35<= 0
1 HE Qualification 20-40% 35<= 0
2 A Level or Equivalent 20-40% 35<= 0
3 A Level or Equivalent 40-60% 35<= 0
4 Lower Than A Level 40-60% 35> 0
studied_credits disability final_result date_unregistration
0 240 N Pass NaN
1 60 N Pass NaN
2 60 Y Withdrawn 12.0
3 60 N Pass NaN
4 60 N Pass NaN
TASK2: Categorize students based on the day they registered for a course. In other words, you need to bin the registration data based on the date_registration column. Just to illustrate this idea, you can group students into categories such as "Very early birds", "early birds", "in-time", and "late-comers". You can use the categories given in this example or create your own categories.
birds = ["Very early birds", "Early birds", "in-time", "late-comers"]
regdata.date_registration = pd.qcut(regdata.date_registration, 4, labels=birds)
studentInfo_data = studentInfo_data.merge(regdata.loc[:
,['id_student','code_module','date_registration']], how='inner',
on=['id_student','code_module'])
studentInfo_data.head()
| code_module | code_presentation | id_student | gender | region | highest_education | imd_band | age_band | num_of_prev_attempts | studied_credits | disability | final_result | date_unregistration | date_registration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAA | 2013J | 11391 | M | East Anglian Region | HE Qualification | 80-100% | 35<= | 0 | 240 | N | Pass | NaN | Very early birds |
| 1 | AAA | 2013J | 28400 | F | Scotland | HE Qualification | 20-40% | 35<= | 0 | 60 | N | Pass | NaN | in-time |
| 2 | AAA | 2013J | 30268 | F | North Western Region | A Level or Equivalent | 20-40% | 35<= | 0 | 60 | Y | Withdrawn | 12.0 | Early birds |
| 3 | AAA | 2013J | 31604 | F | South East Region | A Level or Equivalent | 40-60% | 35<= | 0 | 60 | N | Pass | NaN | in-time |
| 4 | AAA | 2013J | 32885 | F | West Midlands Region | Lower Than A Level | 40-60% | 35> | 0 | 60 | N | Pass | NaN | Very early birds |
TASK3: Categorize students based on the day they unregistered a course. In other words, you need to bin registration date based on the date_unregistration column. You are free to determine the number and the name of the categories (as in Task1).
unreg = studentInfo_data.dropna(subset=['date_unregistration'])
unreg.loc[unreg.date_unregistration < 0] = math.nan
categ = ["first_half", "second_half"]
new_group = pd.cut(unreg.date_unregistration, 2, labels=categ)
unreg = unreg.assign(date_unregistration = new_group)
studentInfo_data = studentInfo_data.assign(date_unregistration = pd.cut(studentInfo_data.date_unregistration, 2,
labels=["first half", "second half"]))
px.bar(unreg['date_unregistration'].value_counts())
there is a huge differences between first half and second half
TASK4: Choose THREE variables from demographic data (studentInfo.csv), and explore if there is some relationship between students' registration/unregistration behaviour and the chosen demographic variables (e.g., did students from HE registered early? did male students unregistered sooner than female students?). You are free in exploring the data to answer similar questions that you determine. If you find no relationship, this is totally fine. Just remember that your analysis should be accompanied with meaningful interpretations.
unreg.head()
| code_module | code_presentation | id_student | gender | region | highest_education | imd_band | age_band | num_of_prev_attempts | studied_credits | disability | final_result | date_unregistration | date_registration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | AAA | 2013J | 30268.0 | F | North Western Region | A Level or Equivalent | 20-40% | 35<= | 0.0 | 60.0 | Y | Withdrawn | first_half | Early birds |
| 14 | AAA | 2013J | 65002.0 | F | East Anglian Region | A Level or Equivalent | 60-80% | 35> | 0.0 | 60.0 | N | Withdrawn | first_half | Very early birds |
| 15 | AAA | 2013J | 65002.0 | F | East Anglian Region | A Level or Equivalent | 60-80% | 35> | 0.0 | 60.0 | N | Withdrawn | first_half | Very early birds |
| 18 | AAA | 2014J | 65002.0 | F | East Anglian Region | A Level or Equivalent | 60-80% | 35> | 1.0 | 60.0 | N | Withdrawn | first_half | Very early birds |
| 19 | AAA | 2014J | 65002.0 | F | East Anglian Region | A Level or Equivalent | 60-80% | 35> | 1.0 | 60.0 | N | Withdrawn | first_half | Very early birds |
sns.catplot(data=studentInfo_data, x='highest_education', row='gender', kind="count",
hue='age_band', col='date_unregistration')
C:\Users\sidar\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
<seaborn.axisgrid.FacetGrid at 0x1aa31eef810>
When we look at these plots, many comments can be made. For example, the rate of unregistration is higher in men than in women, or the rate of people over 35 is higher than people under 35.
Course components table (moodle.csv) contains information about the available materials in the Moodle LMS. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The moodle.csv file contains the following columns:
TASK1: In this dataset, some columns contain mainly missing values. Detect them and drop them to save space in the memory.
course_comp = pd.read_csv("moodle.csv")
print(course_comp.head())
id_site code_module code_presentation activity_type week_from week_to 0 546943 AAA 2013J resource NaN NaN 1 546712 AAA 2013J oucontent NaN NaN 2 546998 AAA 2013J resource NaN NaN 3 546888 AAA 2013J url NaN NaN 4 547035 AAA 2013J resource NaN NaN
course_comp.isna().sum()
id_site 0 code_module 0 code_presentation 0 activity_type 0 week_from 5243 week_to 5243 dtype: int64
week_from and week_to have 5243 missing values
course_comp.dropna(subset=['week_from', 'week_to'], inplace=True)
course_comp.isna().sum()
id_site 0 code_module 0 code_presentation 0 activity_type 0 week_from 0 week_to 0 dtype: int64
course_comp.head()
| id_site | code_module | code_presentation | activity_type | week_from | week_to | |
|---|---|---|---|---|---|---|
| 113 | 546732 | AAA | 2013J | oucontent | 2.0 | 2.0 |
| 198 | 546719 | AAA | 2013J | oucontent | 1.0 | 1.0 |
| 210 | 546681 | AAA | 2013J | oucontent | 1.0 | 1.0 |
| 264 | 877040 | AAA | 2014J | oucontent | 2.0 | 2.0 |
| 323 | 877045 | AAA | 2014J | oucontent | 1.0 | 1.0 |
TASK2: First identify the top 5 popular course component (activity_type) across all courses. Then, create a new table that displays how many times each of these popular components were included in each offering (code_presentation) of each course (code_module). Briefly interpret this table.
course_comp.activity_type.value_counts()
activity_type oucontent 381 url 352 resource 148 subpage 135 questionnaire 42 page 33 quiz 13 ouwiki 6 dualpane 5 repeatactivity 4 dataplus 2 Name: count, dtype: int64
ouconcet, url, resource, subpage, questionnaire are most popular activity type
popular_ac = ["oucontent",
"url",
"resource",
"subpage",
"questionnaire"]
filtered_course_comp = course_comp[course_comp['activity_type'].isin(popular_ac)]
component_counts = filtered_course_comp.groupby(['code_module', 'code_presentation', 'activity_type']).size().unstack(fill_value=0)
print(component_counts.head())
activity_type oucontent questionnaire resource subpage \
code_module code_presentation
AAA 2013J 3 0 0 0
2014J 3 0 0 0
BBB 2013J 1 0 40 24
2014B 1 0 40 24
2014J 24 0 8 0
activity_type url
code_module code_presentation
AAA 2013J 0
2014J 0
BBB 2013J 11
2014B 11
2014J 1
component_counts = component_counts.reset_index()
melted_data = pd.melt(component_counts, id_vars=['code_module', 'code_presentation'], var_name='activity_type', value_name='count')
sns.catplot(x='activity_type', y='count', hue='code_presentation', col='code_module', kind='bar', data=melted_data)
C:\Users\sidar\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
<seaborn.axisgrid.FacetGrid at 0x1aa35637d90>
Activity types generally appear in FFF and BBB
Student activity data (studentMoodleInteract.csv) contains information about each student’s interactions with the materials in the VLE. This file contains the following columns:
studac_data = pd.read_csv("studentMoodleInteract.csv")
studac_data.head()
| code_module | code_presentation | id_student | id_site | date | sum_click | |
|---|---|---|---|---|---|---|
| 0 | AAA | 2013J | 28400 | 546652 | -10 | 4 |
| 1 | AAA | 2013J | 28400 | 546652 | -10 | 1 |
| 2 | AAA | 2013J | 28400 | 546652 | -10 | 1 |
| 3 | AAA | 2013J | 28400 | 546614 | -10 | 11 |
| 4 | AAA | 2013J | 28400 | 546714 | -10 | 1 |
TASK1: Display the total number of clicks for each course per each semester delivered. Besides a textual output, some visualizations must be provided for helping to interpret the data.
clicks = studac_data.groupby(['code_module', 'code_presentation'])['sum_click'].sum()
clicks = clicks.fillna(0)
print(clicks)
code_module code_presentation
AAA 2013J 648494
2014J 598158
BBB 2013B 1347911
2013J 1378656
2014B 833865
2014J 1673744
CCC 2014B 1889177
2014J 2792972
DDD 2013B 1387467
2013J 1757214
2014B 947657
2014J 1437751
EEE 2013J 1535953
2014B 832038
2014J 1616236
FFF 2013B 4220055
2013J 5116439
2014B 2975619
2014J 5281809
GGG 2013J 509091
2014B 425165
2014J 399628
Name: sum_click, dtype: int64
clicks_df = clicks.reset_index()
clicks_df
| code_module | code_presentation | sum_click | |
|---|---|---|---|
| 0 | AAA | 2013J | 648494 |
| 1 | AAA | 2014J | 598158 |
| 2 | BBB | 2013B | 1347911 |
| 3 | BBB | 2013J | 1378656 |
| 4 | BBB | 2014B | 833865 |
| 5 | BBB | 2014J | 1673744 |
| 6 | CCC | 2014B | 1889177 |
| 7 | CCC | 2014J | 2792972 |
| 8 | DDD | 2013B | 1387467 |
| 9 | DDD | 2013J | 1757214 |
| 10 | DDD | 2014B | 947657 |
| 11 | DDD | 2014J | 1437751 |
| 12 | EEE | 2013J | 1535953 |
| 13 | EEE | 2014B | 832038 |
| 14 | EEE | 2014J | 1616236 |
| 15 | FFF | 2013B | 4220055 |
| 16 | FFF | 2013J | 5116439 |
| 17 | FFF | 2014B | 2975619 |
| 18 | FFF | 2014J | 5281809 |
| 19 | GGG | 2013J | 509091 |
| 20 | GGG | 2014B | 425165 |
| 21 | GGG | 2014J | 399628 |
sns.barplot(x='code_module', y='sum_click', hue='code_presentation', data=clicks_df)
<Axes: xlabel='code_module', ylabel='sum_click'>
in FFF course have most click in 2014j
TASK2: As a follow up to the first task, identify the courses in which the total number of clicks is higher in 2014 than 2013. If the course was taught two times in the same year (such as, 2013B and 2013J) use the average of both semesters ((2013B+2013J)/2) to compare with the other year.
clicks.index.name = None
clicks.index.name = None
clicks_pivot = clicks.unstack(level='code_presentation')
clicks_pivot = clicks_pivot.fillna(0)
print(clicks_pivot)
code_presentation 2013B 2013J 2014B 2014J code_module AAA 0.0 648494.0 0.0 598158.0 BBB 1347911.0 1378656.0 833865.0 1673744.0 CCC 0.0 0.0 1889177.0 2792972.0 DDD 1387467.0 1757214.0 947657.0 1437751.0 EEE 0.0 1535953.0 832038.0 1616236.0 FFF 4220055.0 5116439.0 2975619.0 5281809.0 GGG 0.0 509091.0 425165.0 399628.0
result_df = pd.DataFrame(index=clicks_pivot.index)
both_semesters_mask_2013 = (clicks_pivot['2013B'] > 0) & (clicks_pivot['2013J'] > 0)
total_clicks_2013 = clicks_pivot[['2013B', '2013J']].sum(axis=1)
total_clicks_2013[both_semesters_mask_2013] /= 2
both_semesters_mask_2014 = (clicks_pivot['2014B'] > 0) & (clicks_pivot['2014J'] > 0)
total_clicks_2014 = clicks_pivot[['2014B', '2014J']].sum(axis=1)
total_clicks_2014[both_semesters_mask_2014] /= 2
result_df['2013'] = total_clicks_2013
result_df['2014'] = total_clicks_2014
result_df['Higher_in_2014'] = total_clicks_2014 > total_clicks_2013
print(result_df)
2013 2014 Higher_in_2014 code_module AAA 648494.0 598158.0 False BBB 1363283.5 1253804.5 False CCC 0.0 2341074.5 True DDD 1572340.5 1192704.0 False EEE 1535953.0 1224137.0 False FFF 4668247.0 4128714.0 False GGG 509091.0 412396.5 False
except CCC course 2013 clicks is higher
TASK3: Which type of resources were mostly clicked by the students? Do you observe a common pattern accross courses (e.g., in almost all courses, clicks on resource is higher than quiz)? A heatmap as a visualization might be helpful here.
course_comp2 = pd.read_csv("moodle.csv")
course_comp2 = pd.DataFrame(course_comp2)
course_comp2.head()
| id_site | code_module | code_presentation | activity_type | week_from | week_to | |
|---|---|---|---|---|---|---|
| 0 | 546943 | AAA | 2013J | resource | NaN | NaN |
| 1 | 546712 | AAA | 2013J | oucontent | NaN | NaN |
| 2 | 546998 | AAA | 2013J | resource | NaN | NaN |
| 3 | 546888 | AAA | 2013J | url | NaN | NaN |
| 4 | 547035 | AAA | 2013J | resource | NaN | NaN |
studac_data2 = pd.read_csv("studentMoodleInteract.csv")
merge_data2 = course_comp2.merge(studac_data2.groupby(['id_site', 'code_module'])[['sum_click']].sum().reset_index(),
how='inner', on=['code_module', 'id_site'])
merge_data2.head()
| id_site | code_module | code_presentation | activity_type | week_from | week_to | sum_click | |
|---|---|---|---|---|---|---|---|
| 0 | 546943 | AAA | 2013J | resource | NaN | NaN | 36 |
| 1 | 546712 | AAA | 2013J | oucontent | NaN | NaN | 527 |
| 2 | 546998 | AAA | 2013J | resource | NaN | NaN | 6 |
| 3 | 546888 | AAA | 2013J | url | NaN | NaN | 15 |
| 4 | 547035 | AAA | 2013J | resource | NaN | NaN | 28 |
pivot_table2 = merge_data2.pivot_table(index='code_module', columns="activity_type",
values='sum_click', aggfunc='sum')
pivot_table2
| activity_type | dataplus | dualpane | externalquiz | folder | forumng | glossary | homepage | htmlactivity | oucollaborate | oucontent | ouelluminate | ouwiki | page | questionnaire | quiz | repeatactivity | resource | sharedsubpage | subpage | url |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| code_module | ||||||||||||||||||||
| AAA | 3398.0 | NaN | NaN | NaN | 337103.0 | 656.0 | 266720.0 | NaN | 498.0 | 523394.0 | NaN | NaN | NaN | NaN | NaN | NaN | 14924.0 | NaN | 71582.0 | 28377.0 |
| BBB | NaN | NaN | NaN | NaN | 2109720.0 | 9820.0 | 1136792.0 | NaN | 12556.0 | 921260.0 | 1433.0 | NaN | NaN | 6730.0 | 462995.0 | NaN | 213605.0 | 171.0 | 273600.0 | 85494.0 |
| CCC | NaN | NaN | NaN | NaN | 643094.0 | NaN | 756032.0 | NaN | 9822.0 | 516502.0 | NaN | NaN | 4188.0 | NaN | 2164110.0 | NaN | 217846.0 | NaN | 342574.0 | 27981.0 |
| DDD | NaN | NaN | 64292.0 | NaN | 1493375.0 | 63336.0 | 1433206.0 | NaN | 53390.0 | 686331.0 | 13565.0 | 275867.0 | 1536.0 | NaN | NaN | NaN | 301644.0 | NaN | 979982.0 | 163565.0 |
| EEE | NaN | 4871.0 | NaN | NaN | 792656.0 | NaN | 654525.0 | NaN | 6920.0 | 1514727.0 | NaN | 457939.0 | 879.0 | NaN | 322772.0 | NaN | 52423.0 | NaN | 83270.0 | 93245.0 |
| FFF | 44070.0 | 15845.0 | NaN | 5420.0 | 2358160.0 | 6142.0 | 2418335.0 | 9239.0 | 25788.0 | 6567216.0 | 24030.0 | 160706.0 | 57028.0 | 58034.0 | 3831026.0 | 9.0 | 237075.0 | NaN | 1607759.0 | 168040.0 |
| GGG | NaN | NaN | NaN | NaN | 239282.0 | 8008.0 | 283454.0 | NaN | NaN | 477373.0 | NaN | NaN | NaN | NaN | 200337.0 | NaN | 72615.0 | NaN | 52815.0 | NaN |
pivot_table2.plot(kind='bar').legend(bbox_to_anchor =(1, 1))
<matplotlib.legend.Legend at 0x790d6109db10>
When we look at all code_modules, we see that the highest values are generally quiz and oucontent. Quiz is larger in BBB, CCC, DDD modules, and oucontent is larger in other modules.
TASK4: For each student, compute the total number of clicks per each course component type (activity_type column in moodle.csv) separately for each course and semester. A simple representation of the expected table is provided below with fake data (note that in the given example columns and rows are incomplete).
| Student Id | code_module | code_presentation | Assignment | |
|---|---|---|---|---|
| 1234 | AAA | 2013J | 23 | 33 |
| 1234 | BBB | 2014B | 5 | 42 |
Note that, in this task you actually create some features that can be used for predictive modeling.
ac_mod = studac_data2.merge(course_comp2, how='outer', on=['id_site', 'code_module','code_presentation'])
ac_mod = ac_mod.groupby(['id_student', 'id_site', 'code_module', 'code_presentation','activity_type'])['sum_click'].sum()
ac_mod = ac_mod.reset_index()
ac_mod = ac_mod.pivot_table(columns='activity_type', index=['id_student','code_module', 'code_presentation'], values='sum_click', aggfunc='sum')
ac_mod.columns.name = None
ac_mod.index.name = None
ac_mod = ac_mod.reset_index().fillna(0)
ac_mod.head()
| id_student | code_module | code_presentation | dataplus | dualpane | externalquiz | folder | forumng | glossary | homepage | ... | ouelluminate | ouwiki | page | questionnaire | quiz | repeatactivity | resource | sharedsubpage | subpage | url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6516.0 | AAA | 2014J | 21.0 | 0.0 | 0.0 | 0.0 | 451.0 | 0.0 | 497.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 31.0 | 0.0 | 143.0 | 143.0 |
| 1 | 8462.0 | DDD | 2013J | 0.0 | 0.0 | 12.0 | 0.0 | 36.0 | 0.0 | 184.0 | ... | 0.0 | 18.0 | 0.0 | 0.0 | 0.0 | 0.0 | 70.0 | 0.0 | 227.0 | 23.0 |
| 2 | 8462.0 | DDD | 2014J | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 7.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 11391.0 | AAA | 2013J | 0.0 | 0.0 | 0.0 | 0.0 | 193.0 | 0.0 | 138.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 | 0.0 | 32.0 | 5.0 |
| 4 | 23629.0 | BBB | 2013B | 0.0 | 0.0 | 0.0 | 0.0 | 87.0 | 0.0 | 36.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 31.0 | 0.0 | 2.0 | 0.0 | 5.0 | 0.0 |
5 rows × 23 columns
TASK5: Using proper visualizations and statistical analysis, please explore if there is any relationship between students' course performance (final_result column in studentInfo.csv) and clicks on different resources.
studentInfo_data.head()
| code_module | code_presentation | id_student | gender | region | highest_education | imd_band | age_band | num_of_prev_attempts | studied_credits | disability | final_result | date_unregistration | date_registration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAA | 2013J | 11391 | M | East Anglian Region | HE Qualification | 80-100% | 35<= | 0 | 240 | N | Pass | NaN | Very early birds |
| 1 | AAA | 2013J | 28400 | F | Scotland | HE Qualification | 20-40% | 35<= | 0 | 60 | N | Pass | NaN | in-time |
| 2 | AAA | 2013J | 30268 | F | North Western Region | A Level or Equivalent | 20-40% | 35<= | 0 | 60 | Y | Withdrawn | first half | Early birds |
| 3 | AAA | 2013J | 31604 | F | South East Region | A Level or Equivalent | 40-60% | 35<= | 0 | 60 | N | Pass | NaN | in-time |
| 4 | AAA | 2013J | 32885 | F | West Midlands Region | Lower Than A Level | 40-60% | 35> | 0 | 60 | N | Pass | NaN | Very early birds |
merged_data5 = ac_mod.merge(studentInfo_data.loc[:,['id_student', 'code_module', 'code_presentation','final_result']], how='inner', on=['id_student', 'code_module','code_presentation']).drop_duplicates()
merged_data5.head()
| id_student | code_module | code_presentation | dataplus | dualpane | externalquiz | folder | forumng | glossary | homepage | ... | ouwiki | page | questionnaire | quiz | repeatactivity | resource | sharedsubpage | subpage | url | final_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6516.0 | AAA | 2014J | 21.0 | 0.0 | 0.0 | 0.0 | 451.0 | 0.0 | 497.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 31.0 | 0.0 | 143.0 | 143.0 | Pass |
| 1 | 8462.0 | DDD | 2013J | 0.0 | 0.0 | 12.0 | 0.0 | 36.0 | 0.0 | 184.0 | ... | 18.0 | 0.0 | 0.0 | 0.0 | 0.0 | 70.0 | 0.0 | 227.0 | 23.0 | Withdrawn |
| 5 | 8462.0 | DDD | 2014J | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 7.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Withdrawn |
| 9 | 11391.0 | AAA | 2013J | 0.0 | 0.0 | 0.0 | 0.0 | 193.0 | 0.0 | 138.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 | 0.0 | 32.0 | 5.0 | Pass |
| 10 | 23629.0 | BBB | 2013B | 0.0 | 0.0 | 0.0 | 0.0 | 87.0 | 0.0 | 36.0 | ... | 0.0 | 0.0 | 0.0 | 31.0 | 0.0 | 2.0 | 0.0 | 5.0 | 0.0 | Fail |
5 rows × 24 columns
resource_columns = ['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng', 'glossary', 'homepage',
'htmlactivity', 'oucollaborate', 'oucontent', 'ouelluminate', 'ouwiki',
'page', 'questionnaire', 'quiz', 'repeatactivity', 'resource',
'sharedsubpage', 'subpage', 'url']
melted222_data = merged_data5.melt(id_vars=['final_result'], value_vars=resource_columns, var_name='resource_type', value_name='clicks')
plt.figure(figsize=(15, 20))
for i, resource_type in enumerate(resource_columns, 1):
plt.subplot(7, 3, i)
sns.boxplot(x='final_result', y=resource_type, data=merged_data5)
according to box plot oucontent,quiz, homepage have relationship with final_result
In this section, you will build a machine learning model to predict students' final course outcome (final_result column in studentInfo.csv). That is, whether student is 'Pass', 'Withdrawn', 'Fail', or 'Distinction'. If you consider the number of students in some of these categories are too few, you can combine them into a new category.
In Section 1.2, you explored demographic data about students and tuned some categorical variables. From these categorical variables, please generate at least 10 dummy variables to be used for predictors in the machine learning model.
studentInfo_data.head()
| code_module | code_presentation | id_student | gender | region | highest_education | imd_band | age_band | num_of_prev_attempts | studied_credits | disability | final_result | date_unregistration | date_registration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAA | 2013J | 11391 | M | East Anglian Region | HE Qualification | 80-100% | 35<= | 0 | 240 | N | Pass | NaN | Very early birds |
| 1 | AAA | 2013J | 28400 | F | Scotland | HE Qualification | 20-40% | 35<= | 0 | 60 | N | Pass | NaN | in-time |
| 2 | AAA | 2013J | 30268 | F | North Western Region | A Level or Equivalent | 20-40% | 35<= | 0 | 60 | Y | Withdrawn | first half | Early birds |
| 3 | AAA | 2013J | 31604 | F | South East Region | A Level or Equivalent | 40-60% | 35<= | 0 | 60 | N | Pass | NaN | in-time |
| 4 | AAA | 2013J | 32885 | F | West Midlands Region | Lower Than A Level | 40-60% | 35> | 0 | 60 | N | Pass | NaN | Very early birds |
mydummy = pd.get_dummies(studentInfo_data, columns=['gender', 'highest_education', 'region','imd_band', 'age_band', 'disability', 'date_registration'],drop_first=True)
mydummy.head()
| code_module | code_presentation | id_student | num_of_prev_attempts | studied_credits | final_result | date_unregistration | gender_M | highest_education_HE Qualification | highest_education_Lower Than A Level | ... | region_Yorkshire Region | imd_band_20-40% | imd_band_40-60% | imd_band_60-80% | imd_band_80-100% | age_band_35> | disability_Y | date_registration_Early birds | date_registration_in-time | date_registration_late-comers | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAA | 2013J | 11391 | 0 | 240 | Pass | NaN | True | True | False | ... | False | False | False | False | True | False | False | False | False | False |
| 1 | AAA | 2013J | 28400 | 0 | 60 | Pass | NaN | False | True | False | ... | False | True | False | False | False | False | False | False | True | False |
| 2 | AAA | 2013J | 30268 | 0 | 60 | Withdrawn | first half | False | False | False | ... | False | True | False | False | False | False | True | True | False | False |
| 3 | AAA | 2013J | 31604 | 0 | 60 | Pass | NaN | False | False | False | ... | False | False | True | False | False | False | False | False | True | False |
| 4 | AAA | 2013J | 32885 | 0 | 60 | Pass | NaN | False | False | True | ... | False | False | True | False | False | True | False | False | False | False |
5 rows × 31 columns
I convert to categorical columns to dummy variables
In Section 1.5, you have already created some features from students' click behaviour. You can use all of them here as additional predictors.
Additionally, you should create at least 3 features indicating the engagement level of students at different course components. Some example features are provided below :
There is no limit in the type and number of additional feature you can generate from the click data.
ac_mod = ac_mod.merge(studentInfo_data.loc[:,['id_student', 'code_module', 'code_presentation','final_result']], how='inner', on=['id_student', 'code_module','code_presentation']).drop_duplicates()
mydummy = mydummy.merge(ac_mod.iloc[:,:-1], how='inner', on=['id_student', 'code_module','code_presentation'])
mydummy.head()
| code_module | code_presentation | id_student | num_of_prev_attempts | studied_credits | final_result | date_unregistration | gender_M | highest_education_HE Qualification | highest_education_Lower Than A Level | ... | ouelluminate | ouwiki | page | questionnaire | quiz | repeatactivity | resource | sharedsubpage | subpage | url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAA | 2013J | 11391 | 0 | 240 | Pass | NaN | True | True | False | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 | 0.0 | 32.0 | 5.0 |
| 1 | AAA | 2013J | 28400 | 0 | 60 | Pass | NaN | False | True | False | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 12.0 | 0.0 | 87.0 | 48.0 |
| 2 | AAA | 2013J | 30268 | 0 | 60 | Withdrawn | first half | False | False | False | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 | 22.0 | 4.0 |
| 3 | AAA | 2013J | 31604 | 0 | 60 | Pass | NaN | False | False | False | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 19.0 | 0.0 | 144.0 | 90.0 |
| 4 | AAA | 2013J | 32885 | 0 | 60 | Pass | NaN | False | False | True | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 45.0 | 0.0 | 79.0 | 14.0 |
5 rows × 51 columns
mydummy.columns
Index(['code_module', 'code_presentation', 'id_student',
'num_of_prev_attempts', 'studied_credits', 'final_result',
'date_unregistration', 'gender_M', 'highest_education_HE Qualification',
'highest_education_Lower Than A Level', 'region_East Midlands Region',
'region_Ireland', 'region_London Region', 'region_North Region',
'region_North Western Region', 'region_Scotland',
'region_South East Region', 'region_South Region',
'region_South West Region', 'region_Wales',
'region_West Midlands Region', 'region_Yorkshire Region',
'imd_band_20-40%', 'imd_band_40-60%', 'imd_band_60-80%',
'imd_band_80-100%', 'age_band_35>', 'disability_Y',
'date_registration_Early birds', 'date_registration_in-time',
'date_registration_late-comers', 'dataplus', 'dualpane', 'externalquiz',
'folder', 'forumng', 'glossary', 'homepage', 'htmlactivity',
'oucollaborate', 'oucontent', 'ouelluminate', 'ouwiki', 'page',
'questionnaire', 'quiz', 'repeatactivity', 'resource', 'sharedsubpage',
'subpage', 'url'],
dtype='object')
mydummy['engagement_level_3types'] = mydummy[['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng',
'glossary', 'homepage', 'htmlactivity', 'oucollaborate',
'oucontent', 'ouelluminate', 'ouwiki', 'page', 'questionnaire',
'quiz', 'repeatactivity', 'resource', 'sharedsubpage', 'subpage',
'url']].gt(0).sum(axis=1) >= 3
print(mydummy['engagement_level_3types'].value_counts())
engagement_level_3types True 36793 False 1080 Name: count, dtype: int64
the column has two unique values (true and false), and the counts show how many students fall into each category. If true indicates that a student clicked at least three types of course components, then there are 1080 students in that category.
mydummy['avg_clicks_per_semester'] = mydummy[['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng',
'glossary', 'homepage', 'htmlactivity', 'oucollaborate',
'oucontent', 'ouelluminate', 'ouwiki', 'page', 'questionnaire',
'quiz', 'repeatactivity', 'resource', 'sharedsubpage', 'subpage',
'url']].mean(axis=1)
print(mydummy[['avg_clicks_per_semester']])
avg_clicks_per_semester 0 46.70 1 71.75 2 14.05 3 107.90 4 51.70 ... ... 37868 2.05 37869 44.65 37870 15.60 37871 13.75 37872 30.80 [37873 rows x 1 columns]
(avg_clicks_per_semester) representing the average number of clicks across all components for each student in a single course and semester.
print(mydummy['avg_clicks_per_semester'].describe())
count 37873.000000 mean 59.981657 std 80.353663 min 0.050000 25% 10.500000 50% 31.400000 75% 76.500000 max 1206.950000 Name: avg_clicks_per_semester, dtype: float64
Thisprovide information such as the mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for the avg_clicks_per_semester column.
mydummy['engagement_level_alltypes'] = mydummy[['dataplus', 'dualpane', 'externalquiz', 'folder', 'forumng',
'glossary', 'homepage', 'htmlactivity', 'oucollaborate',
'oucontent', 'ouelluminate', 'ouwiki', 'page', 'questionnaire',
'quiz', 'repeatactivity', 'resource', 'sharedsubpage', 'subpage',
'url']].gt(0).all(axis=1)
mydummy['engagement_level_alltypes'].unique()
array([False])
there is no student who clicked all components
mydummy.drop(columns='engagement_level_alltypes', inplace=True)
mydummy = pd.get_dummies(mydummy.drop(columns=['date_unregistration',
'id_student','dataplus', 'dualpane', 'externalquiz',
'folder', 'forumng', 'glossary', 'homepage', 'htmlactivity',
'oucollaborate', 'oucontent', 'ouelluminate', 'ouwiki',
'page','questionnaire', 'quiz', 'repeatactivity', 'resource',
'sharedsubpage','subpage', 'url']), columns=['code_module',
'code_presentation'], drop_first=True)
pd.set_option('display.max_columns', None)
mydummy.head()
| num_of_prev_attempts | studied_credits | final_result | gender_M | highest_education_HE Qualification | highest_education_Lower Than A Level | region_East Midlands Region | region_Ireland | region_London Region | region_North Region | region_North Western Region | region_Scotland | region_South East Region | region_South Region | region_South West Region | region_Wales | region_West Midlands Region | region_Yorkshire Region | imd_band_20-40% | imd_band_40-60% | imd_band_60-80% | imd_band_80-100% | age_band_35> | disability_Y | date_registration_Early birds | date_registration_in-time | date_registration_late-comers | engagement_level_3types | avg_clicks_per_semester | code_module_BBB | code_module_CCC | code_module_DDD | code_module_EEE | code_module_FFF | code_module_GGG | code_presentation_2013J | code_presentation_2014B | code_presentation_2014J | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 240 | Pass | True | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | True | 46.70 | False | False | False | False | False | False | True | False | False |
| 1 | 0 | 60 | Pass | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False | True | 71.75 | False | False | False | False | False | False | True | False | False |
| 2 | 0 | 60 | Withdrawn | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | True | True | False | False | True | 14.05 | False | False | False | False | False | False | True | False | False |
| 3 | 0 | 60 | Pass | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False | False | False | False | False | True | False | True | 107.90 | False | False | False | False | False | False | True | False | False |
| 4 | 0 | 60 | Pass | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False | False | True | False | False | False | False | True | 51.70 | False | False | False | False | False | False | True | False | False |
mydummy['final_result'].unique()
array(['Pass', 'Withdrawn', 'Fail', 'Distinction'], dtype=object)
mydummy['engagement_level_3types'] = mydummy['engagement_level_3types'].astype(int)
d = {'Pass': 1, 'Withdrawn':2, 'Distinction':3, 'Fail':0}
mydummy['final_result'] = mydummy['final_result'].apply(lambda x: d.get(x))
mydummy.head()
| num_of_prev_attempts | studied_credits | final_result | gender_M | highest_education_HE Qualification | highest_education_Lower Than A Level | region_East Midlands Region | region_Ireland | region_London Region | region_North Region | region_North Western Region | region_Scotland | region_South East Region | region_South Region | region_South West Region | region_Wales | region_West Midlands Region | region_Yorkshire Region | imd_band_20-40% | imd_band_40-60% | imd_band_60-80% | imd_band_80-100% | age_band_35> | disability_Y | date_registration_Early birds | date_registration_in-time | date_registration_late-comers | engagement_level_3types | avg_clicks_per_semester | code_module_BBB | code_module_CCC | code_module_DDD | code_module_EEE | code_module_FFF | code_module_GGG | code_presentation_2013J | code_presentation_2014B | code_presentation_2014J | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 240 | 1 | True | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | 1 | 46.70 | False | False | False | False | False | False | True | False | False |
| 1 | 0 | 60 | 1 | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False | 1 | 71.75 | False | False | False | False | False | False | True | False | False |
| 2 | 0 | 60 | 2 | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | True | True | False | False | 1 | 14.05 | False | False | False | False | False | False | True | False | False |
| 3 | 0 | 60 | 1 | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False | False | False | False | False | True | False | 1 | 107.90 | False | False | False | False | False | False | True | False | False |
| 4 | 0 | 60 | 1 | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False | False | True | False | False | False | False | 1 | 51.70 | False | False | False | False | False | False | True | False | False |
mydummy = mydummy[mydummy.final_result.apply(lambda x: x in [0,1])]
As the last activity in this project, you are expected to train and test a logistic regression model for predicting students' final course status. You should use 10-fold cross-validation.
Interpret the results based on confusion matrix and AUC scores. In your interpretation, please also mention the features with high predictive power and those with low predictive power.
Please note that the achieving low/high accuracy in the predictions has no importance for your grade.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
x_train, x_test, y_train, y_test = train_test_split(mydummy.
drop(columns='final_result'), mydummy.loc[:,'final_result'], shuffle=True,
train_size=.8)
model = LogisticRegression(max_iter=3000)
model.fit(X=x_train, y=y_train)
LogisticRegression(max_iter=3000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=3000)
cross_val_score(model, x_test, y_test, cv=10)
array([0.80288462, 0.75480769, 0.78846154, 0.77163462, 0.76923077,
0.75721154, 0.77403846, 0.72836538, 0.78846154, 0.77884615])
y_pred = cross_val_predict(model, x_test, y_test, cv=10)
conf_mat = confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred, target_names=['failed', 'passed']))
precision recall f1-score support
failed 0.74 0.62 0.68 1611
passed 0.78 0.86 0.82 2549
accuracy 0.77 4160
macro avg 0.76 0.74 0.75 4160
weighted avg 0.77 0.77 0.77 4160
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)
0.7443586767026127
for i,v in enumerate(model.coef_[0]):
print(f'Feature: {x_train.columns[i]:<40} Score: {v:.5f}')
Feature: num_of_prev_attempts Score: -0.31264 Feature: studied_credits Score: 0.00066 Feature: gender_M Score: 0.03422 Feature: highest_education_HE Qualification Score: 0.09359 Feature: highest_education_Lower Than A Level Score: -0.54654 Feature: region_East Midlands Region Score: 0.18754 Feature: region_Ireland Score: 0.10683 Feature: region_London Region Score: -0.15509 Feature: region_North Region Score: -0.07396 Feature: region_North Western Region Score: -0.19340 Feature: region_Scotland Score: -0.13420 Feature: region_South East Region Score: 0.19938 Feature: region_South Region Score: 0.14321 Feature: region_South West Region Score: 0.11194 Feature: region_Wales Score: -0.18568 Feature: region_West Midlands Region Score: 0.09697 Feature: region_Yorkshire Region Score: -0.04849 Feature: imd_band_20-40% Score: 0.22244 Feature: imd_band_40-60% Score: 0.31824 Feature: imd_band_60-80% Score: 0.35537 Feature: imd_band_80-100% Score: 0.52841 Feature: age_band_35> Score: 0.22388 Feature: disability_Y Score: -0.18569 Feature: date_registration_Early birds Score: -0.02331 Feature: date_registration_in-time Score: -0.05832 Feature: date_registration_late-comers Score: -0.08498 Feature: engagement_level_3types Score: 2.27893 Feature: avg_clicks_per_semester Score: 0.02883 Feature: code_module_BBB Score: 0.16872 Feature: code_module_CCC Score: -1.12619 Feature: code_module_DDD Score: -0.67909 Feature: code_module_EEE Score: -0.77596 Feature: code_module_FFF Score: -2.24652 Feature: code_module_GGG Score: 0.25861 Feature: code_presentation_2013J Score: 0.52654 Feature: code_presentation_2014B Score: 0.50116 Feature: code_presentation_2014J Score: 0.77358